{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a02af716-448a-4310-8c1c-2107863c8eea",
   "metadata": {},
   "source": [
    "# Pathway analysis\n",
    "\n",
    "Pathway analysis methods provide meaningful insights to better understand how differentially expressed genes (DEG) impact biological processes.\n",
    "Let's see how we can run **Over-Representation Analysis (ORA)** and **Gene Set Enrichment Analysis (GSEA)** on our data, using gseapy package. As the returned objects are pretty big, we delete them at the end of each section for the notebook to save memory.\n",
    "\n",
    "For more details about the functions used here, you can refer to [gseapy documentation](https://gseapy.readthedocs.io), and for more information about GSEA in general (input data etc) : [gsea-msigdb website](https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideFrame.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "de19a5b2-03ae-46c4-b6c3-251a71877597",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pylluminator.utils import load_object\n",
    "from pylluminator.dm import combine_p_values_stouffer\n",
    "from pylluminator.utils import set_logger\n",
    "from pylluminator.visualizations import visualize_gene\n",
    "\n",
    "import numpy as np\n",
    "import gseapy as gp\n",
    "import networkx as nx\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "set_logger('WARNING')  # set the verbosity level, can be DEBUG, INFO, WARNING, ERROR"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a88ca66c-c818-4b0f-9f63-f706b3f4cc96",
   "metadata": {},
   "source": [
    "## Data preparation\n",
    "\n",
    "To run ORA and Pre-rank methods, we need to load DMPs or DMRs. If you haven't done it yet, check out notebook `3 - DMPs and DMRs`. \n",
    "\n",
    "For GSEA method, we use samples data directly - we use the samples stored in your DM object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "593660e9-abae-4095-9522-2301cfb4c8ef",
   "metadata": {},
   "outputs": [],
   "source": [
    "my_dms = load_object('dms')  # load a DM object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a07b5b29-1a8a-4f5d-ab5c-6a3cd40ea08c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# get the genes associated with each probe\n",
    "annotation_colname = 'genes'\n",
    "gene_info = my_dms.samples.annotation.probe_infos[['probe_id', annotation_colname]].drop_duplicates().dropna()\n",
    "# if some probes are associated to several genes, make it one row per gene. Make sure genes are upper case for GSEApy compatibilty\n",
    "gene_info[annotation_colname] = gene_info[annotation_colname].apply(lambda x: x.upper().split(';'))\n",
    "gene_info = gene_info.explode(annotation_colname).drop_duplicates()\n",
    "gene_info.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f274e812",
   "metadata": {},
   "source": [
    "For each gene, we use the DMPs or the DMRs to compute the fold change (average beta difference between the two conditions) and the significance (by combining the p-values of all associated probes using Stouffer's method)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0d03db85",
   "metadata": {},
   "outputs": [],
   "source": [
    "# chose whether to use DMPs or DMRs\n",
    "input_type = 'DMP' \n",
    "\n",
    "# if we work on DMRs, add the probe_ids to the DMRs - for DMPs, just use the DMPs dataframe directly\n",
    "dm_df = my_dms.dmr.join(my_dms.segments.reset_index().set_index('segment_id')) if input_type == 'DMR' else my_dms.dmp \n",
    "# add the gene information \n",
    "dm_df = dm_df.merge(gene_info, on='probe_id')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "69e73639",
   "metadata": {},
   "outputs": [],
   "source": [
    "# set the columns to use to select or rank genes (you can check available columns with dm_df.columns)\n",
    "significance_colname = 'sample_type[T.PREC]_p_value_adjusted'\n",
    "fold_change_colname = 'avg_beta_delta_sample_type_LNCAP_vs_PREC'\n",
    "# set the column of the sample sheet that define the type of sample\n",
    "class_colname = 'sample_type'\n",
    "\n",
    "# keep only useful columns and remove NAs\n",
    "dm_df = dm_df[[annotation_colname, fold_change_colname, significance_colname]].dropna()\n",
    "# aggregate values for each gene\n",
    "gene_fc_sig_df = dm_df.groupby(annotation_colname).agg({fold_change_colname: 'mean', significance_colname: combine_p_values_stouffer})\n",
    "\n",
    "gene_fc_sig_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5df2629",
   "metadata": {},
   "source": [
    "For GSEA, we compute the mean beta values per gene, for each sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "53e1e877-c0f4-43e9-9a00-22b7780f4498",
   "metadata": {},
   "outputs": [],
   "source": [
    "probes_betas_df = my_dms.samples.get_betas().reset_index().set_index('probe_id').drop(columns=['type', 'channel', 'probe_type'])\n",
    "genes_betas_df = probes_betas_df.join(gene_info.set_index('probe_id'), how='right').groupby(annotation_colname).mean()\n",
    "\n",
    "# get the list of sample types, ordered like the beta dataframe\n",
    "sample_info = my_dms.samples.sample_sheet.set_index(my_dms.samples.sample_label_name)\n",
    "sample_types = sample_info.loc[probes_betas_df.columns, class_colname]\n",
    "\n",
    "genes_betas_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e0710732",
   "metadata": {},
   "outputs": [],
   "source": [
    "del dm_df\n",
    "del probes_betas_df\n",
    "del gene_info"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88d0e26f-9ef6-45cf-99c9-7704a74d291a",
   "metadata": {},
   "source": [
    "### Gene set selection\n",
    "\n",
    "You will first need to select the gene set(s) to use in you pathway analysis. You can browse [enrichr website](https://maayanlab.cloud/Enrichr/#libraries) or use the `gp.get_library_name()` function to list available libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "61a31ea3-8988-4597-9e12-cacdb452a7e7",
   "metadata": {},
   "outputs": [],
   "source": [
    "gene_sets = ['GO_Biological_Process_2025', 'KEGG_2021_Human']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "245a0cc1-707c-492c-8108-82d32047509c",
   "metadata": {},
   "source": [
    "### Organism specification\n",
    "\n",
    "Specify the organism you're working on: Human, Mouse, Yeast, Fly, Fish or Worm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "317b2999-67f3-4ed9-a3d5-16e91e47b32f",
   "metadata": {},
   "outputs": [],
   "source": [
    "organism = 'human'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e0ebdaa-e2bd-4bfa-93fa-5b0f43f29c77",
   "metadata": {},
   "source": [
    "## Over-Representation Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81810c2d",
   "metadata": {},
   "source": [
    "### ORA without background\n",
    "\n",
    "The minimal input you need to run an ORA is a list of differentially expressed genes (DEG) from your dataset. Here we use the threshold of `p-value < 0.05` and `abs(fold-change) > 0.2`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9ea62e92-01b8-4722-848b-120d77fa0e99",
   "metadata": {},
   "outputs": [],
   "source": [
    "deg = list(set(gene_fc_sig_df[(gene_fc_sig_df[significance_colname] < 0.05) & (abs(gene_fc_sig_df[fold_change_colname]) > 0.2)].index.values))\n",
    "print(f'Number of genes selected: {len(deg)}/{len(set(gene_fc_sig_df.index))}\\n')\n",
    "\n",
    "enr = gp.enrichr(gene_list=deg, gene_sets=gene_sets,  organism=organism) \n",
    "\n",
    "# output most significant results \n",
    "enr.results.sort_values('Adjusted P-value').head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ea23cae8-d71c-4739-a99a-5dc13d73e1aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = gp.dotplot(enr.results, x='Gene_set', size=3, top_term=5, title='ORA without background', xticklabels_rot=30, show_ring=True, figsize=(5,5))\n",
    "# use smaller fonts\n",
    "for item in ax.get_xticklabels() + ax.get_yticklabels():\n",
    "    item.set_fontsize(8)\n",
    "ax.xaxis.label.set_size(10)\n",
    "ax.title.set_size(15)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be158175-bbe7-4993-b802-1a7bc076cc4a",
   "metadata": {},
   "source": [
    "### ORA with background\n",
    "\n",
    "By default, selected genes are tested against all genes available in the gene sets. For better results, it can be relevant to narrow it down to a list of genes of interest. \n",
    "\n",
    "Let's use the genes that were detected in our experiment as background genes, and see how it changes the result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "46df2eb7-f97c-40cc-9567-1c747d4c19ad",
   "metadata": {},
   "outputs": [],
   "source": [
    "background_genes = list(set(gene_fc_sig_df.index))\n",
    "print('Number of selected genes: ', len(deg))\n",
    "print('Number of background genes: ', len(background_genes), '\\n')\n",
    "\n",
    "enr_bg = gp.enrichr(gene_list=deg, gene_sets=gene_sets, organism=organism, background=background_genes) \n",
    "# output top 5 results results \n",
    "enr_bg.results.sort_values('Adjusted P-value')[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "054e2342",
   "metadata": {},
   "source": [
    "Visualize the top 5 terms for each gene set, with the default significance cutoff set at 0.05"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "881a93ff-2583-4c56-8b56-e365275c4e18",
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = gp.dotplot(enr_bg.results, x='Gene_set', size=2, top_term=5, figsize=(5, 5), y_order=True, title = 'ORA with background', xticklabels_rot=30)\n",
    "# use smaller fonts\n",
    "for item in ax.get_xticklabels() + ax.get_yticklabels():\n",
    "    item.set_fontsize(8)\n",
    "ax.xaxis.label.set_size(10)\n",
    "ax.title.set_size(15)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "476433c5",
   "metadata": {},
   "outputs": [],
   "source": [
    "del deg\n",
    "del enr\n",
    "del enr_bg"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f1a0d49-e2d4-49dc-97ae-65c97e9411a5",
   "metadata": {},
   "source": [
    "## Pre-rank GSEA\n",
    "\n",
    "As we have already calculate DMPs an DMRs, we can use them to rank the genes and use the pre-rank method provided by gseapy. \n",
    "\n",
    "Here we use the formula `sign(FC) * - log10(significance)` to rank the genes, that bring the most upregulated genes at the beginning of the list, the most downregulated at the end, and the less differentialy expressed genes in the middle."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2e455a0d-61e9-4a2a-9b29-db16d9c904a8",
   "metadata": {},
   "outputs": [],
   "source": [
    "def rank_formula(row):\n",
    "    if row[significance_colname] == 0: \n",
    "        return np.sign(row[fold_change_colname]) * np.inf\n",
    "    return np.sign(row[fold_change_colname]) * -np.log10(row[significance_colname])\n",
    "\n",
    "rank_data = gene_fc_sig_df.apply(rank_formula, axis=1).sort_values(ascending=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b1f13d77-1f7d-4abb-8c9a-9eb9daac10ff",
   "metadata": {},
   "outputs": [],
   "source": [
    "# we chose a low permutation number to speed up the demo\n",
    "pre_res = gp.prerank(rnk=rank_data, gene_sets=gene_sets, verbose=True, permutation_num=1000, max_size=300, threads=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9f30af6f",
   "metadata": {},
   "source": [
    "Order the pathways by their False Discovery Rate (FDR), and show the ones that have a FDR lower than 0.05"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bcb69440-0548-453d-8319-a71c713dbdbb",
   "metadata": {},
   "outputs": [],
   "source": [
    "pre_res.res2d = pre_res.res2d.sort_values('FDR q-val').reset_index(drop=True)\n",
    "pre_res.res2d[pre_res.res2d['FDR q-val'] < 0.05]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea7f2809",
   "metadata": {},
   "source": [
    "Now we can visualize the most significant term"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f292576e-13d2-49d3-9281-57738d8708e0",
   "metadata": {},
   "outputs": [],
   "source": [
    "term = pre_res.res2d.Term[0]\n",
    "fig = pre_res.plot(terms=term, figsize=(10, 5))\n",
    "# use smaller fonts\n",
    "for ax in fig.axes:\n",
    "    ax.set_xlabel(ax.get_xlabel(), fontsize=10)\n",
    "    ax.set_ylabel(ax.get_ylabel(), fontsize=10)\n",
    "_ = fig.suptitle(term, fontsize=15) "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f094d1b",
   "metadata": {},
   "source": [
    "Or visualize the top 5 terms on the same plot:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f4123d52-57ca-4688-8d73-8848fc69a96f",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = pre_res.plot(terms=pre_res.res2d.Term[:5],figsize=(3,4))\n",
    "# use smaller fonts\n",
    "for ax in fig.axes:\n",
    "    ax.set_xlabel(ax.get_xlabel(), fontsize=10)\n",
    "    ax.set_ylabel(ax.get_ylabel(), fontsize=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0d030e69-b7cf-4e5e-86f5-6a1c71dcdfd3",
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = gp.dotplot(pre_res.res2d, column='FDR q-val', title='Pre-ranked GSEA', cmap=plt.cm.viridis, size=3, figsize=(3, 4), show_ring=True)\n",
    "# use smaller fonts\n",
    "for item in ax.get_xticklabels() + ax.get_yticklabels():\n",
    "    item.set_fontsize(8)\n",
    "ax.xaxis.label.set_size(10)\n",
    "ax.title.set_size(15)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b8faefca",
   "metadata": {},
   "outputs": [],
   "source": [
    "term_idx = 0  # chose the first term\n",
    "genes = pre_res.res2d.Lead_genes[term_idx].split(\";\")\n",
    "ax = gp.heatmap(df = genes_betas_df.loc[genes], z_score=0, title=pre_res.res2d.Term[term_idx], figsize=(7,6), yticklabels=False)\n",
    "# use smaller fonts and show all labels\n",
    "genes.reverse()  #for labels to be in the right order\n",
    "ax.title.set_size(15)\n",
    "_ = ax.set_xticks(range(len(genes_betas_df.columns)), labels=genes_betas_df.columns, rotation=30, ha=\"center\",  fontsize=8)\n",
    "_ = ax.set_yticks(range(len(genes)), labels=genes, va=\"bottom\", fontsize=8)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13f676c3",
   "metadata": {},
   "source": [
    "Using pylluminator's function, we can go back to checking the beta values of the probes associated to a given gene"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "37d03e04",
   "metadata": {},
   "outputs": [],
   "source": [
    "visualize_gene(my_dms.samples, 'HOXD3', figsize=(20, 6))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b0893bf",
   "metadata": {},
   "source": [
    "Finally, compute the enrichment map to show the relations between the detected relevant pathways"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "744b8322",
   "metadata": {},
   "outputs": [],
   "source": [
    "nodes, edges = gp.enrichment_map(pre_res.res2d, cutoff=0.1)  # chose a higher p-value cutoff to show more nodes, default is 0.05\n",
    "\n",
    "# build graph\n",
    "G = nx.from_pandas_edgelist(edges,\n",
    "                            source='src_idx',\n",
    "                            target='targ_idx',\n",
    "                            edge_attr=['jaccard_coef', 'overlap_coef', 'overlap_genes'])\n",
    "\n",
    "nodes = nodes.loc[G.nodes.keys()]  # remove nodes that are not connected by any edge\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(10, 6), layout='constrained')\n",
    "\n",
    "# init node coordinates\n",
    "pos=nx.layout.bfs_layout(G, start=nodes.index[0])\n",
    "\n",
    "# add a colorbar\n",
    "nodes_cmap = plt.cm.RdYlBu\n",
    "max_NES = max(abs(min(nodes.NES)), max(nodes.NES))\n",
    "sm = plt.cm.ScalarMappable(cmap=nodes_cmap, norm=plt.Normalize(vmin=-max_NES, vmax=max_NES))\n",
    "sm.set_array([])  \n",
    "cbar = plt.colorbar(sm, ax=ax, shrink=0.25)\n",
    "cbar.set_label('NES (Normalized Enrichment Score)', rotation=270, labelpad=15)\n",
    "\n",
    "# draw nodes\n",
    "nx.draw_networkx_nodes(G, pos=pos,\n",
    "                       cmap=nodes_cmap, node_color=nodes.NES, vmin=-max_NES, vmax=max_NES,\n",
    "                       margins=0.3, node_size=list(nodes.Hits_ratio *1000))\n",
    "# draw node labels - wrap labels so that they don't overlap\n",
    "max_length = 35\n",
    "wrapped_labels = {k: \"\\n\".join([v[i:i+max_length] for i in range(0, len(v), max_length)]) for k, v in nodes.Term.to_dict().items()}\n",
    "nx.draw_networkx_labels(G, pos=pos, font_size= 6, labels=wrapped_labels)\n",
    "\n",
    "# draw edges\n",
    "edge_weight = nx.get_edge_attributes(G, 'jaccard_coef').values()\n",
    "nx.draw_networkx_edges(G, pos=pos, width=list(map(lambda x: x*10, edge_weight)))\n",
    "\n",
    "plt.axis('off')\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1a6c9f37",
   "metadata": {},
   "outputs": [],
   "source": [
    "del gene_fc_sig_df\n",
    "del pre_res\n",
    "del my_dms"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e12275e",
   "metadata": {},
   "source": [
    "## GSEA"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cfa2bb6e",
   "metadata": {},
   "source": [
    "Since we are working with two groups (healthy control cells and prostate cancer cells), we can directly use the GSEA function on our 'raw' data (here, the genes beta values) with the corresponding phenotype dataframe (sample_types), to find the enriched pathways. Uncomment the following code to run the analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a59389e2-e255-4e6f-8015-48b57a2a97b9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# gs = gp.GSEA(data=genes_betas_df.dropna(), gene_sets=gene_sets, classes=sample_types, permutation_num=1000)  \n",
    "# gs.pheno_neg = 'PREC'  # control samples\n",
    "# gs.pheno_pos = 'LNCAP'  # samples of prostate cancer cells\n",
    "# gs.run()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "206f5a4e",
   "metadata": {},
   "source": [
    "Visualize the top 5 identified pathways"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fbdbc115-01d1-4e80-a452-011319a351d8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# _ = gs.plot(gs.res2d.Term[:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a69f08ed",
   "metadata": {},
   "source": [
    "Visualize, for a given pathway and for each gene, the samples z-score."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1d4855f7-9f18-4d43-8c66-6e01d3766f80",
   "metadata": {},
   "outputs": [],
   "source": [
    "# term_idx = 0  # chose the first term\n",
    "# genes = gs.res2d.Lead_genes[term_idx].split(\";\")\n",
    "# ax = gp.heatmap(df = genes_betas_df.loc[genes], z_score=0, title=gs.res2d.Term[term_idx], figsize=(8,4))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "pylluminator",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}